Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM

نویسندگان

  • Robbie Haertel
  • Peter McClanahan
  • Eric K. Ringger
چکیده

We are interested in diacritizing Semitic languages, especially Syriac, using only diacritized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approach that uses a hybrid wordand consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the use of a morphological analyzer. In Syriac, we reduce the WER over a strong baseline by 30% to achieve a WER of 10.5%. We also report results for Hebrew and English.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic

In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the ...

متن کامل

Maximum entropy modeling for diacritization of Arabic text

We propose a novel modeling framework for automatic diacritization of Arabic text. The framework is based on Markov modeling where each grapheme is modeled as a state emitting a diacritic (or none) from the diacritic space. This space is exactly defined using 13 diacritics and a null-diacritic and covers all the diacritics used in any Arabic text. The state emission probabilities are estimated ...

متن کامل

Diacritization as a Machine Translation Problem and as a Sequence Labeling Problem

In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-b...

متن کامل

Exploiting Arabic Diacritization for High Quality Automatic Annotation

We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for ...

متن کامل

Diacritization for Real-World Arabic Texts

For Arabic, diacritizing written text is important for many NLP tasks. In the work presented here, we investigate the quality of a diacritization approach, with a high success rate for treebank data but with a more limited success on realworld data. One of the problems we encountered is the non-standard use of the hamza diacritic, which leads to a decrease in diacritization accuracy. If an auto...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010